Skip to content

feat: Empty Block Complementing#343

Open
ravisoundar wants to merge 2 commits into
mainfrom
rs-complement
Open

feat: Empty Block Complementing#343
ravisoundar wants to merge 2 commits into
mainfrom
rs-complement

Conversation

@ravisoundar

Copy link
Copy Markdown
Collaborator

Description

Empty Block Complementing for Slurm Block Topology.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner June 4, 2026 06:00
@copy-pr-bot

copy-pr-bot Bot commented Jun 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR implements Empty Block Complementing for Slurm block topology: when accelerator domains have more hosts than baseBlockSize, or when domains are unevenly sized, the emitted block list is padded with structural empty-block placeholders so Slurm can reason about the full tree capacity. The feature is activated on both the cluster-wide Slurm block path and the per-partition YAML path.

  • block_pack.go / block_complement.go: New files that build a partition-scoped ordered domain list, split oversized domains across multiple base blocks, pad each domain group to a power-of-two multiple, and expand the list to the next tree-capacity boundary. hostsForBlock intentionally filters to only the partition's own nodes, preventing cross-partition node leakage previously flagged in review.
  • block.go: After complement, refreshes nodeInfo.blockID so GetNodeTopologySpec always returns IDs that match the emitted topology file — fixes the stale-ID bug noted in earlier review threads.
  • slurm.go: Adds validateBlockSizes (called eagerly for both global and per-partition configs) and changes GetTranslateConfig to return *httperr.Error directly, avoiding double-wrapping and enabling a 400 Bad Request for malformed inputs.

Confidence Score: 5/5

Safe to merge. The complement logic is well-guarded, partition-scoped, and deterministic; all previously flagged issues have been addressed.

All three substantive issues from prior review rounds (stale blockID, nil dereference in initBlocks, cross-partition node contamination) are explicitly fixed and regression-tested. The new packing/complement code is thoroughly unit-tested across edge cases. The only remaining nit is a misleading error message wording in validateBlockSizes.

pkg/engines/slurm/slurm.go — the validateBlockSizes error messages are slightly inaccurate in their wording but do not affect correctness.

Important Files Changed

Filename Overview
pkg/translate/block_complement.go New file implementing the core complement logic: builds ordered domain lists scoped to partition-local blocks, computes group sizes, packs domains, adds tree-capacity padding, and decides whether to replace the input block list. Logic is correct and well-guarded.
pkg/translate/block_pack.go New file with domain packing helpers: hostsForBlock filters to partition-local nodes (fixing cross-partition contamination), splitHostsIntoBlocks, padDomainGroup, and ID assignment. All edge cases handled.
pkg/translate/block.go Now calls complementBlocks before emitting topology, then refreshes nodeInfo.blockID so GetNodeTopologySpec stays consistent with the emitted file. Also handles empty-node blocks correctly with the skeletonOnly
pkg/translate/yaml.go Threads complementBlocks into getBlockTopologyUnit for per-partition YAML path; correctly populates block names and parents map after complement. Empty block slots emit YAML entries with no nodes field (omitempty).
pkg/translate/topology.go Stores graph.Domains in nt.domains for complement use. Adds a nil guard on hostInfo lookup in initBlocks, fixing the previous nil-dereference risk.
pkg/engines/slurm/slurm.go Adds validateBlockSizes (called for both global and per-partition configs), changes GetTranslateConfig return type to *httperr.Error. Error messages in validateBlockSizes are slightly misleading (see comment).
pkg/topology/domain.go Replaces map[string]string DomainMap value with *HostInfo carrying Domain, InstanceID, and HostName, enabling the complement path to access instance metadata without a separate lookup.
pkg/engines/slinky/engine.go Propagates the GetTranslateConfig return-type change; now correctly forwards the *httperr.Error instead of re-wrapping it.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[toBlockTopology / getBlockTopologyUnit] --> B[complementBlocks]
    B --> C{fanoutsPerLevel OK\n& domains != nil?}
    C -- No --> Z[return original blocks]
    C -- Yes --> D[orderedDomainsForBlocks\nfilter to partition-local nodes]
    D --> E{all blocks have\ndomain entries?}
    E -- No --> Z
    E -- Yes --> F[groupSizeFromOrderedDomains\nsmallest 2^n x base ge maxAccelSize]
    F --> G[packOrderedDomainsIntoBlocks\nsplit + pad each domain group]
    G --> H{len out != len blocks?}
    H -- Yes --> I[expandedBaseBlockSlots\npad to next tree capacity]
    H -- No --> J[assignSequentialBlockIDs]
    I --> J
    J --> K{shouldUseComplementedBlocks?}
    K -- hasEmptySlots or len>input --> L[return complemented blocks]
    K -- No --> Z
    L --> M[refresh nodeInfo.blockID\nfor GetNodeTopologySpec]
Loading

Reviews (17): Last reviewed commit: "replace the temporary block tree used fo..." | Re-trigger Greptile

Comment thread pkg/translate/yaml.go
Comment thread pkg/translate/block_complement.go
Comment thread pkg/translate/topology.go Outdated
@ravisoundar ravisoundar force-pushed the rs-complement branch 3 times, most recently from 91ad1ef to c122256 Compare June 6, 2026 20:06
Comment thread pkg/translate/block_complement.go Outdated
@ravisoundar ravisoundar force-pushed the rs-complement branch 4 times, most recently from f048cb8 to eb4d16a Compare June 10, 2026 22:01
Comment thread pkg/translate/block.go
@ravisoundar ravisoundar force-pushed the rs-complement branch 4 times, most recently from dba4934 to 7f30438 Compare June 10, 2026 23:44
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.77358% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.08%. Comparing base (1875ab8) to head (f59beef).
⚠️ Report is 69 commits behind head on main.

Files with missing lines Patch % Lines
pkg/translate/block_pack.go 83.50% 8 Missing and 8 partials ⚠️
pkg/translate/block_complement.go 70.58% 3 Missing and 12 partials ⚠️
pkg/engines/slurm/slurm.go 74.07% 2 Missing and 5 partials ⚠️
pkg/translate/topology.go 50.00% 2 Missing and 1 partial ⚠️
pkg/engines/slinky/engine.go 33.33% 1 Missing and 1 partial ⚠️
pkg/topology/domain.go 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #343      +/-   ##
==========================================
+ Coverage   68.46%   71.08%   +2.62%     
==========================================
  Files          82       86       +4     
  Lines        4842     5219     +377     
==========================================
+ Hits         3315     3710     +395     
+ Misses       1395     1318      -77     
- Partials      132      191      +59     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmitsh dmitsh force-pushed the rs-complement branch 2 times, most recently from 0cdb20b to 8589f77 Compare June 11, 2026 20:53
…lat slot packing

Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants